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(54) Network management event correlation in environments containing inoperative network 
elements 



(57) A network monitor (510) for distinguishing 
between broken and inaccessible network elements 
(124, 128-136). The network monitor indudes one or 
more computer readable storage mediums, and compu- 
ter readable program code stored in the one or more 
computer readable storage mediums. The computer 
readable program code includes code for discovering 
the topology of a plurality of network elements, code for 
periodically piolling a plurality of network interfaces 
associated with the plurality of network dements, code 
for computing or validating a criticalRoute attribute for 
each of the plurality of network interfaces, and code for 
analyzing a status of network interfaces identified by the 
criticalRoute attribute of an interface in question (HQ) 



which is not responding to a poll or ping. The computer 
readable program code may also include code for 
establishing a slowPingList and placing in-memory rep- 
resentations of broken or failed network interfeces tt^er- 
eon, thereby reducing the amount of informafon which 
is presented to a network administrator from inaccessi- 
ble elements not responding to a network interface poll. 
A means for correlating and/or suppressing events 
(502) in response to the determination of whether a net- 
work interface is failed or broken is also provided. Infor- 
mation which is not critical to a network administrator 
may be suppressed, and then viewed in a "drill down" 
(522, 722. 822) of a particular network interface. 
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Description 

Reld of the Invention 

5 [0001] The invention pertains to network management systems, and more particularly, to a technique for distinguish- 
ing between broken network elements, and network elements that are inacc?essible due to the broken elements. Gath- 
ered event data is then presented to a network administrator in a simple, clear and manageable way. 

Background of the Invention 

10 

[0002] Network Management Systems tike the OpenView Network Node Manager product are designed to discover 
network topolc^y (i.e., a list of all network elements In a domain, their type, and their connections), monitor the healtfi 
of each network element, and report problems to the network administrator. OpenView Network Node Manager (NNM) 
is a product distributed by Hewlett-Packard Company of Palo Alto. California. 
15 [O0O3] The monitoring hinction of such a system is usually performed by a specialized computer program which peri- 
odically polls each network element and gathers data which is indicative of the network element's health. A nronitor pro- 
gram typically runs on a single host. However, in distributed networks, monitors may run on various nodes in the 
network, witfi each monitor reporting its results to a centralized di^lay 

[0004] A network administrator observes a presentation of network heafth on the display Ideally, if a network element 
so fails, Uie information presented to the network administrator identifies the following: 1) Which element is malfunctioning; 
2) Which other network elements are impacted by a malfunctioning - that is, which functional network elements are 
inaccessible over the network because of a falling device; and 3) which inaccessible network elements are critical to the 
productivflty of an organization relying on the network (thus, reestablishing their availability is a high priority for the net- 
work administrator). 

2S [0005] On many commercial network management products, these three distinct classes of information are consoli- 
dated into one class. Because the failure of a single network element can result in thousands of elements (nodes and 
interfaces) suddenly becoming inaccessible, the network administrator (NA) is ovenwhelmed with information. As a 
result, it might take the NA considerable time to analyze the plethora of information received, and determine the root 
cause of the failure and its Impact on the organization. 

30 [0006] When a network element falls and many additional nodes become inaccessible, a monitor will typically con- 
tinue to poll both tiie functioning nodes and the inaccessible nodes. Monitoring is typically done using ICMP pings 
(Internet Control Message Protocol Echo„Request), SNMP (Simple Network Management Protocol) messages, or IPX 
diagnostic requests. These activities will subsequently be referred to as "queries" or "pings". Wheri a network element 
is accessible, ttiese queries take on the order of rniilisecohds to process. However, when a network element is inacces- 

35 sible, a query can take seconds to timeout. 

[0007] This results in a flood of extraneous network traffic, and consequentiy, a network's performance degrades (e.g. , 
The monitor program may run more slowly - to the point that it actually f^Is behind in its scheduled polls of "functioning" 
nodes. This can lead to even further network degradation.). 

[0008] One product which attenpts to solve the above problems is tite NerveCenter product distributed by Seagate 
40 Software of Scotts Valley, California. However, the NerveCenter product does not contain a monitor program. Results 
are therefore achieved by forcing the NA to manually describe the network using a proprietary topology description lan- 
guage. This task is impractical for networks of any practical size. Purser, changes to the network mandate that a NA 
make equivalent ctianges (manually) to tiie topology description. 

[0009] Another product which attenpts to solve the above problems is OpenView Network Node Managerg oi di5ti"ib- 
45 uted by Hewlett-Packard Company of Palo Alto, California. Releases of OpenView Network Node Manager prior to and 
including version 5.01 (NNM5 01) contain a monitor program called netmon, which monitors a network as described 
supra. NNM5 01 supports environments containing a single netmon, and also supports distributed environments con- 
taining several netmon processes. In a distributed environment, a plurality of netmon processes run on various Collec- 
tion Station hosts, each of which communicates topology and status information to a centralized lUlanagement Station 
so (which runs on a different host in the network) where information is presented to the NA. 

[0010] For ease of description, most of the following description is provided in the context of non-distributed environ- 
ments. FIG. 1 illusti'ates a small network ICQ with netmon running on MGR HOST N 110 and accessing the network 
100 using network interface N.I of MGR HOST N. Netmon' discovers the network 100 using ICMP and SNMP and 
stores tiie topology into the topology database 118 (topo DB) through services provided by the ovtopmd database 
5s server 1 1 6. The ipmap/ovw processes 104 are interconnected 1 06 with ovtopmd 1 1 6, and convert topology information 
into a graphical display 108 which shows all discovered network elements, their connections and tiieir status. 
[001 1] Netmon determines the status of each network element 124, 128-136 by ping'ing tiiem (e.g., using ICMP). If 
a ping reply is returned by a particular network element 124, then the element is Up. Otiienwise. the element 128 is 
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Down. If the element 1 24 is Up. then ipmap/ovw 104 vAW display ttie element as green {conveyed by an empty drcie in 
F!G. 1 , 108, and in FIG. 3, 302). If the element 128 is Down it is displayed as red (conveyed by a filled circle in FIG. 1, 
108, and in FIG. 3, 304). It is also possible for a node or Interface to have a status of Unknown and displayed as blue 
(conveyed by a split circle in FIG. 3. 306-312). The cases where Unknown is used by a conventional network monitor 
5 are rare. 

[001 21 In addition to the topology display, NNM contains an Event System 1 1 4 for communication of node status, inter- 
face status and other information among NNM processes 1 20, 204 and 3rd party toots 206 (FiG. 2). TTiese events are 
displayed to the NA using the xnmevents.web Event Browser 120 tool (as a list of events 122 in chronological order). 
[0013] In FIG. 1. interface B.I of node Router^B 128 has gone down, and has caused tine nodes Router„B 128, 
10 Bridge_C 1 30, X 1 32, Y 1 34 and Z 1 36 to suddenly become inaccessible. This causes the following events to be emitted 
by netmon as it discovers thai these nodes 128-136 and their interlaces are down. 

Interface C.2 Down 
Interface C.I Down 
15 Interface B.I Down 
Interface B.2 Down 
Interface Z.I Down 
Interface Y. 1 Down 
l,nterfaceX.1 Down 

20 

[0014] Notice that the interface Down events are emitted in the random order that netmon polls the interfaces. This 
adds to the NA's difficulty in determining the cause of a failure using the Events Browser. The status of each node 124, 
1 28-136 and interface is also displayed on the ovw screen 1 08. As previously stated, ail inaccessible nodes and inter- 
faces are displayed in the color red (i.e., a filled circle). 
25 [0015] In a real network, with thousands of nodes on the other side of Router_B 128, neither display (ovw 108 or 
xnmevents.web 120) allows the NA to determine the cause of a failure and the urgency of reviving critical nodes in a 
short amount of time. In addition, this system 100 suffers from the network performance degradations described previ- 
ously because netmon continues to poll inaccessible nodes 1 30-1 36. 

[001 6] It is therefore a primary object of ttiis invention to present problems with network elements in a way that clearly 
30 indicates the root cause of a problem, allowing a NA to quicWy begin working on a solution to the problem. 

[001 7] Another object of tiiis invention is to provide a system and method for distinguishing between broken aid inac- 
cessible network elements. 

[0018] An additional object of this invention is to provide a means of suppressing and/or correlating network events 
so as to 1 ) reduce the glut of information received by a NA upon failure of a network element, and 2) provide a means 
35 for the NA to view suppressed information in an orderly way 

[(K)19] It is a further object of this invention to provide a NA with a network monitor which is highly cosfumizable, 
thereby providing a number of formats for viewing information. 

Summary of the Invention 

40 - 

[0020] In the achievement of the foregoing objects, the inventors have devised a network monitor for distinguishing 
between broken and inaccessible network elements. The network monitor conprises one or more computer readme 
storage mediums (e.g., CD-ROfwl, floppy disk, magnetic tape, hard drive, etc.), and conputer readable program code 
stored in the one or more computer readable storage mediums. The computer readable program code comprises 1) 

■45 code for discovering the topology of a plurality of network elements, 2) code for periodically polling a plurality of network 
interfaces associated with the plurality of network elements, 3) code for computing or validating a critical Route attribute 
for esach of the plurdity of network interfaces, and 4) code for analyzing a status of network interfaces identified by the 
criticalRoute attribute of an interface in question (HQ) which is not responding to a poll. Elements of ttie code for discov- 
ering frie topology of a plurality of network elements and the cod© for periodically polling a plurality of network interfaces 

50 associated with the plurality of network elements are disclosed in United States patent 5, 1 85,860 of Wu entitled "So- 
matic-Discovery of Network Elements, and in United States patent 5,276,789 of Besaw et al. entitled "Graphic Display 
of Network Topology". Both of these patents are hereby incorporated by reference for all that they disclose. 
[0021] The above described network monitor (and systems and meSiods for using same) provide many advantages 
over previous network monitor implementations. 

55 [0022] A first advantage is automatic topology. To properly identify the root cause of a network failure requires 1) input 
from a topological model of the network, and 2) the current status of each element on the network Previously, the NA 
was required to describe this topology manually. The design disclosed herein uses topology and status information that 
has already been created. 
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[0023] A secMnd advantage is topological display. Topological graphical display of information is much more useful in 
aiding a network administrator in his or her effort to identify the root cause of a network failure and to set his or her pri- 
orities. The graphical display (ovw) used in ttie preferred embodiment of the invention clearly presents network element 
status in 3 categories: 

5 

Functioning nodes and interfaces are di^layed in green to indicate Up status. 

• Root cause feilures and inaccessible interfaces on attice^ sender nodes are displayed in red to indicate Down sta- 
tus. 

• Non-critical, inaccessible interfeces are displayed in blue to indicate Unknown status. 

10 

[0024] A third advantage is a new manner of displaying events. The Event Browser's 1 20 (FIG. 1 ) di^Iay of node and 
interface status information 122 is much more useful in aiding the network administrator in his effort to identify the root 
cause of a network failure and to set priorities. Secondary failure events (i.e.. events indicating inaccessible interfaces) 
are not displayed with primary failures (\.e., events indicating truly feiled interfeces) and inaccessible critical nodes. Sec- 

15 ondary failures are viewable via a "drill down" process. 

[0025] A fourth advantage is network performance. With past designs, network performance degrades when a failure 
(Dccurs because many niore network management messages are emitted due to failed queries for inaccessil^e network 
elements. In addition, frie network monitoring processes get behind schedule because failures result in time-outs which 
are much slower than successful queries. This design includes backoff polling algorithms for network monitoring serv- 

so ice so as to avoid these situations whenever possible. 

[0026] A fifth advantage is an ability to classify network elements. Network elements are broken into two classes, 
regular and critical, to help the system display irfornnation to ttie NA in a ^way that reflects his or her priorities. This is 
accomplished by providing a mechanism for the network administrator to define a "filter" which describes router, impor- 
tant servers and other networi< elements that the NA conaders important. 

25 [0027] A sixth advantage is scalability. Other systems are centralized in ardiitectore, and not scalable to large enter- 
prise networks. This implementation builds on ttie OpenVrew distributed architecture and delivers new ftjncti'onaiity for 
"large" networks. 

[0028] A seventh advantage is an ability to handle arbitrary topologies, /^gorithms tiiat attempt to find a root cause of 
a network element are easily fooled because of the complexity in customer network configurations (e.g., they may con- 
30 tain loops and djffiamic routing), /ygorithms in tiiis implementation are "Bottoms up" and will not be fooled into thinking 
a network element is down when a redundant router fails. 

[0029] An eightii advantage is that the system is extremely configurable. A network administrator is therefore allowed 
to make trade-offs that optimize his or her network, worldng style, and perfonnance. Other systems tend to be manually 
configuredale and/or rigid. 

35 [0030] A last advantage is "event ordering." During network failure situations, confusion is compounded by implemen- 
tations that discover inaccessible network elements in random order. This implementation contains new queuing algo- 
rithms to discover failures in a predictable order. This predictability is helpful to both the network administrator, and other 
event con'elation processes that a user or third parties might construct. 

[0031 1 These and other important advantages and objectives of the present invention will be f urtiier explained in, or 
40 will become apparent from, the accompanying description, drawings and claims. 

Brief Description of the Drawings 

[0032] An illustrative and presently preferred embodiment of the invention Is illustrated in the drawings in which; 

45 

FIG. 1 is a block diagram of an NNM5 oi network administrator display; 
FIG. 2 is a block diagram of an NNM5 q-j event distribution system; 
FIG. 3 is a graphical di^lay of network element health; 
FIG- 4 is a block diagram of a preferred event distrfoution system; 
50 FIG. 5 is a block diagram of a preferred network administrator display in a [< ' ), Down,Unknown,True] configuration; 
FIG. 6 is a flow chart illustrating the operation of an ECS router down circuit; 

FIG. 7 is a block diagram of a preferred network administrator display in a [(serverFilter), Down. Unknown. True] 
configuration; and 

FIG. 8 is a block diagram of a preferred network administrator display in a [{ serverFilter). Unknown, Ignore.True] 
55 configuration. 
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Description of the Preferred Embodiment 



[0033] Network management apparatus for distinguishing between broken and inaccessible network elements in a 
network, and for presenting this informafion to a network administrator in an easy to comprehend format, is shown in 

5 FiGS. 5, 7 & 8. The apparatus generally comprises a display process 1 04, 120 and a network monitor 1 1 0, connected 
by way of one or nrare event buses 114, The network monitor 1 10 comprises a means for discovering the topology of a 
plurality of network dements 124, 128-136 connected thereto, means for periodically polling a [iurality of network Inter- 
faces associated with tiie plurality of network elements 1 24, 128-136, means for computing or validating a criticalRoute 
attribtrte for each of the plurality of network interfaces, and means for analyzing the status of network interfaces (e.g., 

10 N. 1 . A, 1 , A.2, B. 1 , 8.2, C. 1 , C.2, X. 1 , Y. 1 , Z. 1 } identified by the criticalRoute attribute of an interface in question (HQ) 
whichiis not responding to a poll. 

[0034] Likewise, a computer implemented method of distinguishing between broken and inaccessible network ele- 
ments; and for presenting this information to a network adminisfrator in an easy to comprehend format, may conprise 
the steps of 1) discovering the topology of a plurality of network elements 124. 128-136, 2) periodically polling a plurality 

15 of network interfaces associated with ttie plurality of network elements 124, 128-136. 3) computing or validating a 
criticalRoute attribute for each of the plurality of network imerfaces, and 4) analyzing the stetus of network interfaces 
identif ied by tfie criticalRoute attribute of an interface in question (ilQ) which is not responding to a poll. 
[0035] Having described a method and apparatus for distinguishing between broken and inaccessible network ele- 
menjs in general, the method and apparatus will now be described in further detail. 

so [0036] The preferred embodiment of the invention is designed to work in conjunction with the OpenView Network 
Node lyianager product for stand-alone and distributed environments (hereinafter "NNM"). The OpenView Network 
Node fvlanager is a product distributed by Hewlett-Packard Conpany of Palo Alto, California. This product is described 
in detail in a number of End User manuals identified by HP part numbers J1 136-90000, J1 136-90001, J1 136-90002, 
J1 136-90004. and J1 136-90005, and a number of Developer Manuals identified by HP part nurrtiers J1 150-90001. 

2S J1 150-90002, J1 150-90003, and J1 1 50-90005. These manuals are hereby incorporated by reference for all that friey 
disclose. 

[0037] The first part of this description will discuss how the system operates in a non-distributed environment A non- 
distributed environment is an environment composed of one Management Station 110 and no Collection Stations, with 
one NNM netmon process running on the Management Station 1 1 0. The ovw and ovevents.web NA display runs on the 
30 Management Station 110. 

The CriticalRoute Attribute 

[0O38] Netmon discovers the topology of a network 100 exactly as in NNM5 ot- However, during nettnon's configura- 
35 tion poll (which typically happens once pe^r day) and during ttie first status poll of each node (after netmon begins run- 
ning), netmon wili compute or validate a per network Interface attribute called criticalRoute, 
[0039] The criticalRoute attribute is a sequence of ovtopmd DB object identifiers which correspond to the route tiiat 
a network packet could take it sent from netmon to a particular interface. The criticalRoute attribute traces the path of 
the intervening network interfaces. 
40 [0040] The following list enumerates the criticalRoute values for each network interface in FIG. 1. 



Network intert^ace 


Critical Route 


N.1 


N.1 


A.1 


N.1,A.1 


A.2 


N.1,A.1,A.2 


B.1 


N.1,A.1,A.2,B.1 


B.2 


N.1,A.1,A.2,B.1.B.2 


0.1 


N.1.A.1,A.2,B.1,B.2,C.1 


C.2 


N.1,A.1,A.2,B.1.B.2,C.1,C.2 


X.I 


N.1,A.1,A.2,B.1,B.2,C.1,C.2,X.1 


Y1 


N.1,A.1,A.2,B.1.B.2,C.1,C.2,Y1 
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(continued) 



Network Interface 


Critical Route 


Z.1 


H1,A:1,A.2.B.1.B.2.C.1,G.2.Z.1 



5 

[0041] CriticsdBoute can be computed even if the network contains loops because at the moment the conputation Is 
performed, ttiere is only one route that a packet would tal«. Mien calculating criticaJRoute when multiple possibilities 
exist, precedence is given to routes within the same network or subnet over routes which trace outside of the network. 
Precedence is also given to routes containing router nodes over other non-router "multi-homed" nodes. 

10 

DhmarvFailures vs. second arvFailures 

[0042] Having calculated a criticalRoute attribute for each network irrterface, tt is then possible to distingui^ 
primaryFailure interfaces from secondaryFaifure interfaces. As used herein, a primaryFailutB interfece is one thai has 
IS felled, while a secondaryFaifure interface is one that is inaccessible due to a foiled interface. 

[0043] Assume that netmon polls network interlaces in the same order as flie events are displayed in the Event 
Browser 120 in FIG. 1. That is: 

Interface C.2 
20 Irrterface C.l 

Interface 8.1 

Interface B.2 

Interface Z.1 

Interface Yl 
2S lnterfaceX.1 

[0044] Prior to t\e status poll of node BRIDGE_C's interfaces, all nodes 124, 128-1 36 ^d interfaces are Up and dis- 
pfayed in the color green on the ovw map 104/108. No interface Doivr; events are in the Events Browser 120, When 
neVnon's ping of interface C.2 times-out, netmon knows friat interface 0.2 is inaccesstole. It does not yet know if tnter- 
30 face C.2 is inaccessible because it is physically down vwth a hardware/software failure or because a connecting inter- 
face is down. 

[0045] In NNM5 01 netmon would simply set the status of the interface to Critical using the ovtopmd AP1 1 1 6. Ovtopmd 
1 16 would then change the status of the interface in the tc^ology database 118 and send outtiie ir^terfaceDown event. 
[0046] As part of the current method, netmon analyzes the status of interfaces along the criticalRoute for the interface 
35 in question (tlQ) and tries to determine which interface contains the hardware /software failure. As previously men- 
tioned, the interface that is inaccessible due to its own HW/SW failure is considered a primaryFailure interface. The 
interfaces that are inaccessible due to the primaryFailure are considered secoridaryFailure interfaces. In this scenario 
we have the following classification of interfaces (assuming interface B.1 has f^ied): 

40 



Interface 


Failure Classification 


N.I 


Not failing. 


A.1 


Not failing. 


A.2 


Not failing. 


B.I 


Primary Failure. 


B.2 


Secondary Failure. 


C.l 


Secondary Failure. 


C2 


Secondary Failure. 


X.I 


Secohdary Failure. 


Yl 


Secondary Failure. 


Z.1 


Secondary Failure. 
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Pre-criticalRouteWaitUst Classification Algorithm 

(00471 When netmon's ping of interface C.2 times-out netmon examines the in-memory status of every interface 
along the cdticalRoute patii for the ilQ (interface C.2). If any interface along this path is Down (Crrtical), then the HQ is 
s a secondaryFailure interface, if the HQ is a secondaryFailure interface, then netmon changes its status to Unknown 
using the libtopm changeSecondaryFailureStatus 0 API call. 

[0048] Ovtopmd 116 changes the status of ttie interface in ttie topology dat^ase 118 and emits a speciai 
secondaryFailure! nterfaceDown message as described in section 3.1 .6. Then netmon puts the internal representation 
of the interface on a stowPingUst and continues its normal status poll processing of other interfaces on the network. 

TO 

criticalRouteWaitUst Classification Algorithm 

[0049] If . the pre-cn'ticalRoute Waitlist interface failure classification algorithm is unable to find an interface (other than 
the HQ) already down along the crittcalRoute, then the status of all interfeces along the criticatRoute(lfQ) must be ver- 
15 if led to ensure that one of these interfaces has not fLuIed since it was checked last, 

[005Q] ^ To fecilitate this, netmon saves the feet that interface 0.2 is inaccessible by moving the representation of tinis 
IlQ from the normal pingListto a new queue called the CriticalRouteWaitUst This list has the following characteristics: 

Any number of Interfaces may be placed on the list, 
20 * Only the first interface on the list is processed by tiie criticalRouteWaitUst algorithm. All other interfaces are just 
held. 

When an interface is on this list, it is not on any other list This prevents processing of this interface by other netinon 
activities. 

This list has data structures to allow sequential walking of the criticalRouteWaitList(IIQ}. This is important because 
25 of the threaded nature of netmon. After sending a ping to a particular interface on the critica{Route(IIQ}. netmon 
performs other tasks while waiting for tiie ping response to return or for a time-out to occur 

[0051 ] Netmon interrogates each interfece along the criticalRoute on behalf of the interface in question (HQ) fay send- 
ing a ping, one interface at a time beginning with the netmon node's interfece (at the Management Station 110), untii it 

30 finds an intaface down or it arrives back at the HQ's inaccessible interface. 

[0052] If an interface (other tiian the HQ) is Down, then it is processed as a primaryFaiture interfece. The interface's 
status is changed to Chtical using the NNM5 01 libtopm API. Ovtopmd changes the status of the interface in the topol- 
ogy database 118 and emits a special prim^yFailureinterfaceDown message. Then netmon nx>ves the internal repre- 
sentation of the interface from the critical RouteWaiiList to the slowFingList and continues its norma! status poii 

35 processing of other interfaces on the network 100. 

[0053] If a primaryFailure (otier tiian the HQ) is found, then the failure of the IlQ is a secondaryFailure, and is proc- 
essed as described above for secondaryFallures. Netmon changes the stafeis of the IlQ to Unknown using the Libtopm 
secondaryFailure 0 API call. Ovtopmd 116 changes frie status of ttie interface in the topology database 1 18 and emits 
a fecial secondaryFaiiurelnterfaceDown message as described below. TTien netmon puts the internal representation 

40 of tiie interface on the slowFingList and continues its norma! status poll processing of other interfaces on the network 
100. 

[0054] If no primaryFailure interface along the crlticalRoute(IIQ) can be found, then the criticalRouteWaitUst process- 
ing eventually arrives back at the HQ interfece. When this occurs we know that tiie tlQ is a primaryFailure. Regular 
primaryFailure processing occurs for tiie IlQ. 
45 [0055] The interface's status is changed to Critical using the NNM5 01 libtopm AP I. Ovtopmd 1 1 6 changes ttie status 
of the interface in tiie topology database 118 and emits a special primaryFailurelnterfaceDown message as described 
below. Then netmon moves the internal representation of the interface from the criticalRouteWaitUst to the 
slowPingUst and continues its normal status poll processing of other interfaces on the network. 

so Poppino the critical RouteWaitList 

[0056] While the criticalRouteWaitUst is being processed, netmon continues to poll other interfaces in the network 
100 as controlled by the pingUst. Some of Uiese may be inaccesside and wind up on the criticalRouteWaitUst by the 
algorithms above. Many of tiiese interfeces could be secondaryFailure interfaces that are due to the same 
55 primaryFailure interface. It would be very inefficient to verify the status of tiie entire criticalRoute for each 
secondaryFailure interface. 

[0057] By the time the first secondaryFailure interface has been identif ied and processed , the primaryFailure interface 
has also been identified and processed. Tliis means that it's now possible to determine if the new HQ (IIQ2) is a 
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primaryFailure interface or a secondaryFailure interfece by examining the in-memory status of eacfi interface along the 
criticalRoute(IIQ2) using the pro-cn'ticalRouteWaitUst dassrfication algorithm, and avoid wasting time sending addi- 
tional pings. 

[0068] If (t can be determined that IIQ2 is a secondaryFailure, ^en seayndaryFailure processing occurs, the interface. 
5 is moved from the criticaiRouteWaitList and placed in the slowPingList and no verification of the criticalRoute status is 
required. Otherwrise, processing for the IIQ2 continues in similarly to processing of the HQ, 

interface Down events for orimarvFailures and secondatyFailures 

10 100591 Discussions above suggest that the ovtopmd daemon 11 6 is responsble for (^nging the status of network 
elements 124. 128-136 in the topology database 1 18 ^d sending out related events. It does so on behalf ot other proc- 
esses when they instruct it to using tte libtopm API. 

[0060] There is no change to this API for primaryFailure events and prlmaryF^lure events use the fJNMg 01 event 
format witiiout change. However, for secondaryFailure events, information ^irt the secondaryFailure must be com- 
15 municated, and in addition , th e primaryFailure interface must be identified. This added requirement is necessary so that 
an event correlation system such as ECS 408 (FIG. 4; distributed by Hewlett-Packard Company of Palo Aito, California) 
can distinguish between the two types of events and correlate and/or suppress them. 

[0061] This is accomplished by adding an additional var-bind (SNMP var-binds are described in detail in "Simple 
Book" by Marshall Rose) to the regular primaryFailure event format. This extra var-bind is called ...primaryFailureUuid 

20 and contains the event UUID (Universally Unique IDentif ier is a handle that is unique to each event) of the con-espond- 
ing primaryFailure event. UUID is a handle that is unique to each event across any computer in a network. There is a 
separate API call, changeSecondaryFailureStatus Q. in libtopm which is used to change the status of the network ele- 
ment. The parameters of the API call are identical to the call used for primaryFailures, plus the addition of the ovwDbId 
of the primaryFailure network element (and the behavior is different). 

25 [0062] When the primaryFailure API call changeStatus 0 is made, ovtopmd 1 16 changes the status in the topology 
datafcjase 1 18 and sends out the appropriate event as in NNM5 01. In addition, it records in its process memory the UUID 
of the Down event. 

[0063] VWien the secondaryFailure API call changeSecondaryFailureStatus 0 is made, ovovtopmd 116 changes the 
status in the topology database 1 18, and constructs an wait as in NNM5 01 . in addition, it takes the ovwDbId parameter 
30 and looks up the con-esponding primaryFailure event UU ID and creates a primaryFailureUuid var-bind to include in the 
secondaryFailure event. Then it emits frte event. The UUID of the secondaryFailure event is recorded in ovovtopmd's 
process memory for possible later use. 

■ slowFingList ' ' ' . ■ - . - - 

35 

[0064] The slowPingUst allows nefinon to perform its polling of down interfaces without getting behind on interfaces 
which are up. By segregating Down interfaces (which are likely to still be down the next time they are polled) ft-om Up 
interfaces, netmon vwll be able to alert the network administrator of transitions from Up to Down in a timely fashion. Net- 
mon will also attenpt fewer retries for interfaces on the sfowPingList, thereby limiting the time and network bandwidth 
40 "wasted" performing these operations. 

PMD/ECS Event Distribution 

[0065] FIG. 1 is a simplified illusfration of the NNM5 qi architecture, including an event system bus 1 14. An event sys- 
45 tern bus 114 may not actually exist. Rather, a socket connection from every communicating tool 120, 202-206 to and 
from the PMD (Post Master Daemon) process 102 may exist (See FIG. 2). A sender 202 sends an event to the PMD 
process 102 and the PMD 102 distributes the event to every listener 120, 204, 206. The connections may be bidirec- 
tional so that every process 102, 120, 202-296 can be a listener and an originator. 

[0066] In the preferred embodiment of the system presented herein, the Event System 200 is enhanced by incorpo- 
50 rating an Event Con-elation System (ECS) 408 into the PMD process 406 (See FIG. 4). ECS 408 is yet another product 
distributed by Hewlett-Packard Company. This product is described in detail in tiie "ECS 2.0 Designer's Reference Man- 
ual" identified by HP part number J 1095-90203. This manual is hereby incorporated by reference for all that it discloses. 
FIG. 4 illustrates the PMD/ECS architecture 502. All events tiiat flow into the PMD 406 flow into the ECS engine 408. 
which can manipulate the events in the following ways: 

55 

' Events can simply pass through the ECS engine 408 unaltered. 

• Events can be stored in the ECS engine 408 for a period of time and released later. 

• Events can be suppressed by the ECS engine 408. That is, they come in but do not flow out. 
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• Events can be correlated with other events independent of suppression. That is, an attribute is attached to the 
event that spedfies a parent event This facilitates the new DrillDown functionality in the Events Browser 120. 

• New events can be generated in addition to or in place of an event that enters ECS 408, 

• Events can be saved in ECS 408 for an extended period of time and used as state information to help interpret the 
s meaning of subsequent events. 

• Events can trigger queries of data external to PMD 406 and the ECS engine 408. to help interpret the meaning of 
the current event. 

[00671 The Events Browser 1 20, some NNM processes, and most 3rd party tools connect to the Correlated Event Bus 
70 402. This bus 402 may have some events suppressed or delayed if the logic in the engine's circuits 408 is designed to 
do so. Many of the original NNM processes 204/206 connect to the Raw + Correlated Event Bus 404 so that they can 
see all events in the system. 

[0068] n is beyond the scope of this document to describe in detail how the logic in the ECS engine 408 is organized. 
In brief, the logic is organized in two layers. The first layer Is a graphical dataflow design {ECS Circuit) constructed 
15 using a develcper GUI and a suite of ECS circuit nodes of various types with various functionaiity. Each node of the cir- 
cuit can contain a conriputer program written in the Everrt Correlation Description Language (ECDL). One of the ECS 
circuit elements is called an Annotate Node 412 and is used to quer/the corresponding Annotation Server 510 external 
to ECS 408 for data ttiat is not contained in ECS 408. 

so User Cbilfiquration Attributes 

[0069] The above-described Event System can be configured by the user to obtain one of many possible behaviors, 
wifri various trade-offs in performance and us^ility. The follovwng list describes ttie main configurable attributes: 

25 • Critical__Node_FiIter_Name (String ) 

TTiis netmon parameter specifies a topology filter which allows netmon to distinguish between a critical node and a 
regular node. Events for critical nodes may be conrelated but are never suppressed by ECS 408 or netmon. even if 
the failure is secondary 

• CriticaLNode_Sec_Status (Down | Unknown > 

30 This netnran parameter describes ttie new status to use for 1he changeStatus event for a critical node with a sec- 
ondaryFailure. PrimaryFailures always receive status Down, regardless of whether tiie node is critical or regular. 

• NonnaLNode_Sec_Status (Down | Unknown | Ignore) 

This netmon parameter describes the new status to use for the changeSlatus event for a regular node 
secondaryFailure. If the value is igr^ore. then the status of tiie node is not changed, it will reman on the map as Up 
OS even though it is inaccessible. 

• Sec_FaiI_Event_Suppress_Switch (False i True) 

This netmon boolean parameter is communicated to ECS 408 via a Router Down annotation server interface 410, 
and informs ECS 408 whether to suppress secondaryFaiiures for normal nodes (i.e. non-critice^ nodes). 

40 [0070] Discussions in this document that reference these user parameters will refer to them in the order above. For 
exanpte, [( >, Down, Ignore, True] indicates no filter, Crjticat_Node_Sec_Status=Down, 
Normal_NGdejSec_Status=lgnore and 
Sec_Fail_Eventi.Suppress_Switch=True. 

45 Behavior with f< > .Down. Unknown. True] Configuration 

[0071] FIG, 1 illustrates the system behavior for NNM5 oi - FIG. 5 illustrates the system behavior for the system dis- 
closed herein, wherein the [< j Down. Unknown, True) configuration is selected. In this system, netmon has recognized 
that interface B.I is the primaryFailure,'and interfaces B.2, C.I , C.2, X. 1 , Y.I and Z.I are secondaryFaiiures. Since B.1 
50 is a primaryFailure interface, it is given status Critical, and displayed as red in ovw 104. An Interface B Down event is 
also emitted. - - 

[0072] Since no filter has been specified {Critic^_Node_Filter_Name=" "). all nodes are considered normal {i.e, no 
nodes are considered Critical), and the CriticaI_Node_Sec_Status attribute is not used. Since 
NormaI_Node_Sec_Status=Unknown. all secondaryFailure interfaces on all nodes 124. 128-136 are displayed as 
55 blue to represent a stafejs of Unknown (i.e.. as a slashed circle in display 1 08 of FIC3. 5). 

[OO733 The ChangeStatus Unknown event is emitted by neimon/ovtopnxi 110/116 for all secondaryFailure interfaces 
once netmon recognizes them as secondaryFailure interfaces. However, ECS 408 suppresses the events because 
Sec_Fail_Event_Suppress_Switch=:True. Therefore, the secondaryFailure events do not show up in the 
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xnmevents.web 120 Event Browser at the tc^ leveJ display 522. 

[0074] New functionality in the Event Browser aliows the user to invoke a menu option that brings up secondaryi^a- 
ures assodated {correlated) with the selected top level event. In this case, selecting Interface B.1 Down and invotong 
"Show Correlated Events" brings up another dialog showing frie related seconda"yFailure events. 
5 [0075] ff one compares FIG. 1 to FIG. 5. one will appreciate that problems outlined in the Background section of this 
disclosure are solved by the FIG. S architecture and configuration. The ovw di^lay 104 identifies the working nodes 
and interfaces (in green), the primaryFailures (m red), and all secondaryFailures (in blue). The event browser di^lay 
522 is uncluttered with secondaryFailures 524, and easily identifies the interface requiring maintenance to the NA. 

10 Critical Node RIter Name and the ECS Annotation Server 

[0076] "me CriticaLN ode_Rlter_Name attribute can be specified by the user and will define two classes ( critical and 
regular) of network nodes using the NNM5 oi networking filter language. This language allows users to describe a sub- 
set of elements in the topology database 118 based tpon attributes rn the database. For example, one couid easily 

IS spedfy a group consisting of ali routers 1 24, 128 and nodes with ipAddress 15. 1 .2.*. 

[0077] This mechanism is provided to allow the user to identify network elements whose accessibility is essential to 
the productivity of the organization. For exanrple, routers 124, 128 and servers nraght be criticat but workstations 132- 
136 and PCs might not. Interfeces that are inaccessible and are primaryFailures are always given a status of Dowri 
regardless of which class they belong to. However, if an interface is inaccessible and is a secondaryFailure, then a crit- 

20 ical node filter can be used to influence the behavior of the system. 

[0078] if an interface is inaccessible and is located on a critical node (as defined by the filter and evaluated by net- 
mon), then tfie CriticaLNodejSec_Status attribute value defines the status ttiat wnll actually be given to the interface. 
The possible values are Down and Unkr)own. This attribute is evaluated by netmon, which Is the entity that insfructs 
ovtopnrKJ 116 to change the status of the interface. 

25 [0079] If an interface is inaccessitile and is located on a regular node (as defined by the filter and ev^uated by net- 
mon), then tfie Normal_Node_Sec_Status attribute value defines the status that will actually be given to the interface. 
The possible values are Down, Unknown and Ignore. Again, this attribute is evaluated by netmon, which is the enfty 
that instructs ovtopmd 1 16 to change the status of the interface. 

[0080] A value of Down or Unknown results in behavior for regular nodes that is analogous to the behavior of critical 
30 nodes. However, a value of ignore instructs netmon to ignore this inaccessible interface on a regular node. That is, do 
not change the status of the interface and do not send out any events regarding this interface. It will remain on the map 
as Up even though it is inaccessible. 

[0081] This configuration is useful when it is desirable to minimize network traffic and NNM performance regarding 
nodes that are hot essential to the productivity of the organization.' In tHis situattdri,* netrrion wili still put the interface oh ' 

3s the slowPingUst so that network traffic is minimized further and netmon status polling remains on schedule. 

[0082] This filter is used by netmon (as described above) when it discovers tiiat a secondaryFailure interface is Down. 
It is also needed by the ECS engine 408 to determine if tiie corresponding event should he suppressed. Since netmon 
is already setup to distinguish between critical and regular nodes, it makes sense to let netmon communicate this dis- 
tinction to the Router Down circuit in the ECS engine 408. 

40 [0083] it does this via the Annotation Server mechanism 412 provided by ECS 408. Whenever a circuit in ECS 408 
needs to know if an event it received corresponds to a critical node or a regular node, the event flows into a con'espond- 
ing Annotate Circuit Node 412, which sends the query to frie Annotation Server process 510 using Uf*JlX® Domain 
sockets. A mechanism other tfian UNIX® Domain sockets is used on Windows® NT. 

[0084] The query arrives at the Router Down Annotation Server 510, which runs the ovwDbld argument through its 
45 filter evaluator and sends the boolean result back to the Annotate Circuit Node 412 in the ECS circuit 408. This partic- 
ular Annotation Server is built into netmon (See FIG. 4). 

[0085] Note that in a distributed system with multiple netinons running, the Crrtical_Node_Rlter_Name attribute on 
the Management Station 510 may be differerrt than the value on a Collection Station. 

so ECS Router Down Circuit 

[0086] Although the ECS engine 408 has considerable power, very little of its potential is used by the Router Down 
circuit because most of the logic has been placed in the netmon process for performance reasons. FIG. 6 illustrates the 
circuit logic 600. 

55 [0087] If the event is not a changeStatus event {nodeDown or interfaceDown) 602/604, or if the event is a 
primaryFailure (because no exfc-a var-binds are present) 606, then the event is immediately passed on to other ECS 
circuit elements 608/626. Otherwise, the event flows into the secondaryFailure path 610 where it is correlated with tiie 
primaryFailure event 61 2. This correlation is nothing more than logging an attribute in a log file that identifies the parent 
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event of the current ev/ent. This is possible because the event UUID of the parent event (the primaryFaiiure event) is in 
the extra var-hind. This correlation facilitates Drill Dovm with the Events Browser 

[00881 At th is point tiie circuit needs to dedde if the event should be suppressed- If the event correspwids to a critical 
node 620, me event wili not be suppressed because it is important that the NA know that this important server or router 
5 be r^aired immediately 618, The circuit determines whether the node is critical or regular by querying the Router 
Down Annotation server embedded in netmon 61 6. 

[0089] If the event con-esponds to a regular node 622, then the circuit examines the value of ttie 
Sec_Fail„Event_Suppress_Switch attribute and behaves accordingly 624/626/628. All of these attributes are config- 
ured in netmon's configuration. This attribute is not actually used by netmon. it is only used by the ECS circuit 600. 
10 Therefore, the value of this attribute is also communicated to ECS i^a the Annotation Server interface. 

Behavior with f lse ryerRlterJ. Down. Unknown. True] Configuration 

[OO90] FIG. 7 illustrates system behavior with the [< serverFilter). Down, Unknown, True] configuration. This configu- 
15 ration differs from tfie FIG. 5 configuration because the user has specified a ftiter using the 
CrtlicaLNode_Frlter_Name attribute. The filter in FIG. 7 has been designed to identify node Z as a critical node. For 
example, ttie productivity of the organization may be dependent on the availability of an appiication server running on 
node 2. 

[0091] In this scenario, netmon has recognized that interface B.1 is the primaryFaiiure, and interfaces B.2. C.I , C.2, 
20 X.I. Y.1 and Z.I are secondary. Since 8.1 is a primaryFaiiure interface, it is given status Critical, displayed as red in 
ovw, and an Interface B Down event is emitted. 

[0092] Since interface Z.1 is a secondary/Failure interface located on a critical Node, it is given a status specified by 
the Critical_Node_Sec_Status attribute which has been configured to have a value of Down. Node Z and interface Z.I 
are displayed as red in ovw, and an Interface Z. 1 Down event is emitted. 

25 [0093] Alt remaining secoridaryFailure interfaces are given the status specified by the Normaf_Mode_Sec_Status 
attribute, which has a value of Unknown. These interfaces are di^iayed in blue to represent a status of Unknown. 
[0094] The ctiangeStatus Unknown event is emitted by netmon/ovtopmd for all non-crib'cal I secondaryFailure irrter- 
faces once netinon recognizes them as secondaryFailure interfaces. However, ECS 408 suppresses the events 
because Sec„Fail_Event_Suppress_Switch=True. Therefore, the secondaryFailure events that do not correspond to 

30 critical nodes do not show up in tiie xnmevents.web Event Browser 1 20 at the top level di^lay 722. 

[0095] New functionality in ttie Event Browser 120 allows the user to invoke a menu option that brings up 
secondaryFailures 724 associated {correlated) with the selected top level event. In this case, selecting "Interface B.I 
Down" and invoWng "Show Correlated Events" brings up another dialog 724 showing the related secondaryFailure 
events. Notice that interface 2.1 shows up in the top level display 722 and in the Drill Down display 724. It shows up in 

as the top level di^lay 722 because it resides on a node that is cr/Ucal. It shows up in the Drill Down disf^ay 724 because 
it is inaccessible due to tiie failure of interface B.1 . 

[0096] In comparing FIG. 1 to FIG. 7, one can appreciate that all three requirements identified in the Background sec- 
tion of this disclosure are now met by the architecture and corrfiguration of this invention. Tfie ovw display 104/108 iden- 
tifies the working nodes and intetfaces (in green), the primaryFaiiure interfaces (in red), the critical secondaryFaiiure 
40 interfaces (in red) and all regular secondaryFailures (in blue). TTie event browser display 722 is uncluttered wttii non- 
critical secondaryFailures. and easily identifies an interface requiring maintenance to the NA. 

Behavior with [ feerverFilter> Unknown, lonore. True] Configuration 

45 [0097] FIG. 8 illustrates system behavior vwth a [(serverFilter ), Unknown, Ignore, True] configuration. This configura- 
tion differs from the configuration shown in FIG. 7 in that a user has specified that secondary/Failures of critical nodes 
should be given a status of Unknown, and secondaryFailures on regular nodes should be ignored. 
[0098] "ntis configuration has'sdvantages and disadvantages. The main advantage is that system and network per- 
formance is very good because fewer sfatus changes and events are generated on Collection and/or Management Sta- 

so tions. The ovw display 104/108 and the event browser display 822 communicate more of tie impact of tiie failure 
because primaryFailures and secondaryFailures of interfaces of critical nodes are in different colors, and the clutter of 
unimportant secondaryFailures is not shown. 

[0099] The disadvantage is that secondaryFailures of unimportant nodes are displayed as accessible when tfiey are 
not. This is a trade-off that that the network administrator must evaluate when he or ^e chooses this configuration. 
55 (01 00] in this scenario, netmon recognizes friat interface B. 1 is a primaryFaiiure, and interfaces 0.2, 0. 1 , C. 2, X. 1 , Y 1 
and 2.1 are secondary. Since B.1 is a primaryFaiiure interface, it is given status Critical, and is displayed as red in ovw 
104/108. Furthermore, an Internee B down event is emitted. 

[0101] Since interface Z.I is a secondaryFailure residing on a critical node, it is given a status specified by the 
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Crit(cal_Nade_Sec_Stattis attribute, which has been configured to have a value of Unknown. Interface 2.1 is dis- 
played as blue in ovw, and an Interface Z / i/Mnoivn event is emitted. 

[0102] All remaining secondaryFailure Interfaces are ignored. No status changes occur and no events are emitted. 
They continue to be displayed as green representing a status value of Up. However, netmon still goes into it's backup 
polling mode. 

[0103] Notice that FIG. 8 shows several interface Up events, which collectively make the event display cluttered. This 
illustration is somewhat misleading. These interface Up events are shown to illustrate that initially when netmon discov- 
ers the nodes and interfaces, the interface Up events vyill be transmitted. However, ttiat should only happen once. It will 
never happen again unless the node is pfiysically moved on the network. 

[01 04] During typical operation, the NA will only see ttie two events, Interface B. 1 Down and Interface Z. 1 Unknown. 
Similarly, the interface Up events in FiGS. 1 . 5 & 7 vrill only happen when the nodes 124, 128-136 are discovered. The 
displays in each of these four scenarios will be uncluttered and useful to the NA as they try to pinpoint a faulty network 
element. 

[0105] New functionality in the Event Browser 1?0 allows the user to invoke a menu option that brings up 
secondaryFailures associated (cofrelated) with tiie selected top level e/ent. In this case, selecting "Interface 8.1 Down" 
and invofcng "Show Con-elated Events" brings up another dialog showing the related secondaryFai/ure events. Notice 
that interface Z.1 again shows up in the top level display 822, and in ttie Drill Down display 824. It shows up in the top 
level display 822 because it resides on a node ttiat is considered critical. It shows up in the Drill Down display 824 
because it is inaccessible due to the failure of interface B,1, 

Behavior with [<.)Down. Down. False] Configuration 

[01 06] This configuration forces a system to behave very similar to NNM5 oi because all inaccessible interfaces are 
given a status of Down, displayed as red in ovw, and no events are suppressed (See FIG 1). The system behavior dif- 
fers in the follow ways: 

• SecondaryFailure interfaces are still recognized by netmon and their node Down and interface Down events will 
contain the extra var-bind. 

• Even though the secondaryFailures are not suppressed, tiiey are still correlated with the primaryFaifure. and are 
also visible via Drill Down. 

• The back-off polling algorithm still provides performance improvements (i.e., the slowPingList is used). 
Distributed Architecture 

[0107] The system and method described herein is readily adaptable to a distributed environment comprising both 
Collection and Management Stations. An exemplary distributed environment to which the disclosed system and method 
are readily adaptable is desaibed in the United States patent application of Eric Pulsipher et al. filed August 29, 1996, 
Serial Nunlber 08/705,358. entitied "Distributed Internet Monitoring System and Mefrtod". 

[01 08] While illustrative and presently prefa-red embodiments of the invention have been described in detail herein, 
it is to be understood that the inventive concepts may be otherwise variously embodied and employed, and that the 
appended claims are intended to be construed to include sudn variations, except as limited by tfie prior art. 

Claims 

1 . A network monitor (51 0) for distinguishing between broken and inaccessible network elements, comprising: 

a) one or more computer readable storage mediums; and 

b) computer readable program code stored in the one or more computer readable storage mediums, the com- 
puter readable program code conprising: 

i) code for discovering the topology of a plurality of network elements (1 24, 128-1 36): 

ii) code for periodically polling a plurality of network interfaces associated with the plurality of network e!e- 
merrts; 

ii) code for computing or validating a critical Route attrbute for each of the plurality of network interfaces; 
and 

iv) code for analyzing a status of network interfaces identified by the criticalRoute attribute of an interface 
in question (HQ) which is not responding to a poll. 
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2. A network monitor {51 0) as in claim 1 , further comprising code for est^iishing a slowPingUst and placing in-mem- 
ory representations of broken or failed network interfaces thereon. 

3. Network management apparatus for distinguishing between broken and inaccessible network elements, and for 
5 presenting tiiis information to a network administrator in an easy to comprehend format, comprising: 

a) a display process (104 or 120); and 

b) a network monitor (510) comprising; 

i) means for discovering tile topology of a plurality of network elements (124, 1 28-1 36) connected thereto; 

ii) means for periodically polling a plurality of network interfaces associated with the plurality of network 
elements; 

iii) means for computing or validating a criticalRoute attribute tor each of tiie plurality of network interfeces; 
and 

'5 iv) means for anal yzing a status of network interfaces identified by tine criticalRoute attribute of an interface 

in question (HQ) which is not responding to a poll; 

wherein the display process and network monitor communicate by way of one or more event tiuses 

. (114). 

20 4. Network management apparatus as in claim 3. further comprising: 

a) an event conelation system (408) comprising an annotation node (408); and 

b) an annotation server (510); 

wherein the one or mors event buses (1 14) comprise a correlated event bus (402), the display process 
25 (120) receives event data over the con-elated event bus, and the annotation server and the annotation node 

communicate by way of a communication channel (41 0); 

wherein the annotation server is configured to process the following annotation information: 

i) an annotation providing for critical node recognition; 
30 ii} an annotation indicating how critical node secondaryFailures should be processed and displayed; 

iii) an annotation indicating how regular node secondaryFailures should be processed and di^layed; and 

iv) an annotation providing for regular node secondary Failure si^ipression. 

5. Network management apparatus as in claim 4, wherein: 

a) the event correlation system comprises a means for suppressing events so they do not appear on the cor- 
related events bus (402); and 

b) the display process (120) comprises a drill down interface (522, 722, 822) through which various suppressed 
;, ; events may be called up for viewing. 

40 

6, A computer implemented method of distinguishing between broken and inaccessible network elements, and for 
presenting this information to a network administrator in an easy to comprehend format, comprising the steps of; 

a) discovering the topology of a plurality of network elements (124, 128-136); 
<5 b) periodically polling a plurality of network interfaces associated with tiie plurality of network elements; 

c) computing or validating a critical Route attribute for each of the plurality of network interfaces; and 

d) analyzing a status of network interfaces identified by the criticalRoute attribute of an interface in question 
(HQ) which is not responding to' a poll. 

50 7. A method as in claim 6, wherein the step of analyzing the status of network interlaces identified by the criticalRoute 
atthbute of an HQ which is not responding to a poll comprises tiie steps of: 

a) first examining in-memory statuses of one or more network interfaces to determine whetiier a network inter- 
face identified by the criticalRoute atb-ibute for tiie HQ is Down, and if so, identifying that network interface as 

55 a primary Failure interface; 

b) second, if a network interface identified by the criticalRoute attribute for the HQ has not been identified as a 
primaryFailure interface, verifying the statuses of one or more network interfaces identified by tiie criticalRoute 
attribute for the HQ to determine if one of them is a primaryFailure interface; and 
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c) third, if a network interface identified by the critical Route attribute for the HQ has not been identified as a pri- 
maryFailure interface, identifying the HQ as a primaryFatlure interface. 

8. A method as in daim 6, further comprising the step of maintaining a critical RouteWaitList wherein a pre-crificat- 
RouteWaitList algoritiim comprises the st^s of: 

a) exaTiining in-memory statuses of one or more network interfaces to determine whetiier a network interface 
identified by the critical Route attribute for the HQ is Down, and if so, identifying that network interface as a pii- 
maryFaiiure interface; and 

b) if the above step successfully identifies a network interface as a primaryFailure interface, 

i) identifying the liQ as a secondaryFaiiure interfece; 

ii) emitting a secondaryFailurelnterfaceDown event; and 

iii) placing an in-memory representation o* the HQ on a siowPingList. 

9. A method as in daim 8, wherein a criticalRouteWaitUst algorithm comprises the steps of; 

a) if none of tiie network interfaces identified by the critical Route attribute for the HQ have been identified as a 
primaryFailure interface, verifying the statuses of one or more network interfaces identified by the critical Route 
atfributefor the HQ by. 

i) moving an in-memory representation of the HQ. and all network interfaces identified by ttie crificalRoute 
attribute for the HQ, to a critical RouteWaitList, and removing these network interfeces from all other lists 
accessible to a network monitor executing this method; 

ii) sequentially walking the criticalRouteWfeitU'st, polling each network interface to determine if it is a pri- 
maryFailure interface; and 

b) if a network interface identified by the critical Route attribute for the HQ has not been identified as a primary- 
Failure interface, identifying the HQ as a primaryFailure interface, otherwise identifying the HQ as a secondar- 
yFaiiure interface. 

1 0. A method as in claim 6, further comprising tiie step of using an HQ's criticalRoute attribute to identify an IIQ as a 
primaryFailure interface or a secondaryFaiiure interface. 
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